Introduction

In our previous lesson, we created a PBMC object, completed clustering, and performed annotation. This serves as the foundation for all our subsequent analyses. In this lesson, we’ll systematically explore various visualization techniques using SeuratExtend to create high-quality, publication-ready figures.

Let’s start by loading the necessary packages and our previously saved PBMC object:

library(Seurat)
library(SeuratExtend)

# Load the PBMC object we saved at the end of Lesson 2
pbmc <- readRDS("rds/pbmc_annotated.rds")

1. Create an Enhanced Dimensional Reduction Plot

In Seurat, dimension reduction plots such as UMAP are typically created using DimPlot for discrete variables and FeaturePlot for continuous variables. SeuratExtend simplifies this process with DimPlot2, which does not require differentiation between variable types. This function automatically recognizes the type of input parameters, whether discrete or continuous. Additionally, DimPlot2 introduces numerous extra parameters to enrich the customization of the plots.

Basic Usage

To generate a basic dimension reduction plot, simply call DimPlot2 with your Seurat object:

DimPlot2(pbmc)

Visualizing Different Variables

DimPlot2 can handle both discrete and continuous variables seamlessly. Here’s how to input different variables into the plot:

DimPlot2(pbmc, features = c("cluster", "MS4A1", "CD14", "CD3D"))

Splitting by Variable

You can also split the visualization by a specific variable, which is particularly useful for comparative analysis across conditions or identities. In our current PBMC object, we only have one sample. Let’s create a new column in the metadata to simulate a scenario with two samples for demonstration purposes:

set.seed(42)  # For reproducibility
pbmc$sample <- sample(c("sample1", "sample1", "sample2"), size = ncol(pbmc), replace = TRUE)

table(pbmc$sample)  # Check the distribution
## 
## sample1 sample2 
##    1796     904

Now, let’s create a split plot:

DimPlot2(pbmc, features = c("cluster", "CD14"), split.by = "sample", ncol = 1)

Highlighting Specific Cells

To highlight cells of interest, such as a specific cluster, you can define the cells explicitly and use them in your plot:

b_cells <- colnames(pbmc)[pbmc$cluster == "B cell"]
DimPlot2(pbmc, cells.highlight = b_cells)

Advanced Customization

For each variable, you can specify custom colors, adjust themes, and more. For detailed information on color customization, refer to the Explore Color Functions section:

DimPlot2(
  pbmc,
  features = c("cluster", "sample", "CD14", "CD3D"),
  cols = list(
    "cluster" = "light",
    "CD14" = "D",
    "CD3D" = c("#EEEEEE", "black")
  ),
  theme = NoAxes())

Adding Labels and Boxes

To further enhance the plot, you can add labels and bounding boxes to clearly delineate different groups or points of interest:

DimPlot2(pbmc, label = TRUE, box = TRUE, label.color = "black", repel = TRUE, theme = NoLegend())

Simplifying Labels with Indices

Sometimes, cluster names are too lengthy and can make the plot appear cluttered when displayed with labels. To address this, consider using indices to replace the cluster names in the plot, which helps make the visualization cleaner. For instance, you can label clusters as ‘C1’, ‘C2’, etc., on the plot itself, while detailing what each index stands for (e.g., ‘C1: B cell’, ‘C2: CD4 T Memory’) in the figure legend:

DimPlot2(pbmc, index.title = "C", box = TRUE, label.color = "black")

This approach ensures that the plot remains legible and aesthetically pleasing, even when dealing with numerous or complex labels.

2. Simultaneous Display of Three Features on a Dimension Reduction Plot

In SeuratExtend, a unique visualization method allows for the simultaneous display of three features on the same dimension reduction plot. The functions FeaturePlot3 and FeaturePlot3.grid employ a color mixing system (either RYB or RGB) to represent three different genes (or other continuous variables). This method uses the principles of color mixing to quantitatively display the expression levels or intensities of these three features in each cell.

RYB and RGB Color Systems

In the RGB system, black represents no or low expression, and brighter colors indicate higher levels: RGB System Example

In the RYB system, white represents no expression, and deeper colors indicate higher expression levels: RYB System Example

Examples Using RYB and RGB Systems

Here’s how to display three markers using the RYB system, with red for CD3D, yellow for CD14, and blue for CD79A:

FeaturePlot3(pbmc, color = "ryb", feature.1 = "CD3D", feature.2 = "CD14", feature.3 = "CD79A", pt.size = 0.5)

For the RGB system, with red for CD3D, green for CD14, and blue for CD79A:

FeaturePlot3(pbmc, color = "rgb", feature.1 = "CD3D", feature.2 = "CD14", feature.3 = "CD79A", pt.size = 1)

Batch Visualization with FeaturePlot3.grid

FeaturePlot3.grid extends FeaturePlot3 by allowing multiple plots to be generated in one go. The features parameter requires a vector where every three values are assigned a color (RYB or RGB) and placed together in one plot. If you wish to skip a color, use NA as a placeholder.

For instance, to place the following five genes into two plots using the RYB system, and skip yellow in the second plot:

FeaturePlot3.grid(pbmc, features = c("CD3D", "CD14", "CD79A", "FCGR3A", NA, "LYZ"), pt.size = 0.5)

Tips on Point Size

The background is usually white, so the choice of color system and point size can significantly affect visual perception. In the RYB system, where higher expression results in darker colors, a smaller pt.size is preferable to prevent overlapping points. In contrast, in the RGB system, higher expressions result in lighter colors, potentially leading to visibility issues for highly expressed cells that may blend into the white background. Here, a larger pt.size is recommended so that the darker, low-expression points can form a “background” to highlight the lighter, high-expression points.

3. Visualize Cluster Distribution in Samples

Introduction

The ClusterDistrBar function is designed to visualize the distribution of clusters across different samples. It can show both absolute counts and proportions, and it allows for various customizations including axis reversal and normalization.

Basic Usage

To create a basic bar plot showing the distribution of clusters within samples, simply specify the origin (sample identifier) and cluster variables from your dataset:

ClusterDistrBar(origin = pbmc$sample, cluster = pbmc$cluster)

Displaying Absolute Cell Counts

If you prefer to visualize the absolute cell count rather than proportions, set the percent parameter to FALSE:

ClusterDistrBar(origin = pbmc$sample, cluster = pbmc$cluster, percent = FALSE)

Vertical Bar Plot

If a vertical orientation is preferred over the default horizontal bars, set the flip parameter to FALSE:

ClusterDistrBar(origin = pbmc$sample, cluster = pbmc$cluster, flip = FALSE, reverse_order = TRUE)

Non-Stacked Bar Plot

If you prefer not to stack the bars, which can be useful for direct comparisons of cluster sizes across samples, set the stack parameter to FALSE:

ClusterDistrBar(origin = pbmc$sample, cluster = pbmc$cluster, flip = FALSE, stack = FALSE)

Exporting Data Matrix

In cases where a visual plot is not required and only the underlying data matrix is needed, set the plot parameter to FALSE:

data_matrix <- ClusterDistrBar(origin = pbmc$sample, cluster = pbmc$cluster, plot = FALSE)
# View the matrix
print(data_matrix)
##                sample1    sample2
## B cell       12.973274 12.8318584
## CD4 T Memory 18.040089 19.1371681
## CD4 T Naive  25.222717 23.7831858
## CD8 T cell   12.472160 11.8362832
## DC            1.113586  1.7699115
## Mono CD14    18.095768 18.4734513
## Mono FCGR3A   5.679287  6.3053097
## NK cell       6.013363  5.0884956
## Platelet      0.389755  0.7743363

4. Create an Enhanced Violin Plot

While DimPlot uses color gradients to represent expression levels, violin plots offer a more quantitative view of expression distribution. The enhanced VlnPlot2 function in SeuratExtend integrates functionalities to superimpose boxplots, easily add statistical annotations, and offers greater flexibility in plot presentation compared to the original VlnPlot in Seurat.

Usage

The method to employ VlnPlot2 will differ depending on whether you’re using a Seurat object or a matrix. In this lesson, we’ll focus on using a Seurat object.

Basic Violin Plot with Box Plot and Points

Let’s start by selecting the genes we want to analyze. Here’s an example using three genes:

genes <- c("CD3D", "CD14", "CD79A")
VlnPlot2(pbmc, features = genes, ncol = 1)

Hide Points and Outliers for a Cleaner Appearance

VlnPlot2(pbmc, features = genes, pt = FALSE, hide.outlier = TRUE, ncol = 1)

Grouping by Cluster and Splitting Each Cluster by Samples

VlnPlot2(pbmc, features = genes, group.by = "cluster", split.by = "sample")

Filtering for Certain Subtypes and Arranging Plots in Columns

cells <- colnames(pbmc)[pbmc$cluster %in% c("B cell", "Mono CD14", "CD8 T cell")]
VlnPlot2(pbmc, features = genes, group.by = "cluster", cells = cells)

Adding Statistical Annotations Using the Wilcoxon Test

VlnPlot2(pbmc, features = genes, group.by = "cluster", cells = cells, 
         stat.method = "wilcox.test", hide.ns = TRUE)

Restricting Statistical Comparisons and Using t-test

VlnPlot2(pbmc, features = genes, group.by = "cluster", cells = cells, 
         stat.method = "t.test", comparisons = list(c(1,2), c(1,3)), hide.ns = FALSE)

5. Generate a Heatmap Plot

Introduction

The Heatmap function provides a flexible and comprehensive way to visualize matrices, especially those produced by the CalcStats function.

Basic Usage

First, let’s generate a sample matrix using the CalcStats function:

# Assuming pbmc data and VariableFeatures function are available
genes <- VariableFeatures(pbmc)
toplot <- CalcStats(pbmc, features = genes, method = "zscore", order = "p", n = 4)
head(toplot, 10)
##            B cell CD4 T Memory CD4 T Naive CD8 T cell          DC  Mono CD14
## CD79A   2.6617833   -0.3210925  -0.3572368 -0.3804323 -0.17865513 -0.3571722
## CD79B   2.4991391   -0.4657078  -0.5071176 -0.4044852 -0.42079797 -0.5171697
## MS4A1   2.6640982   -0.3491962  -0.3920821 -0.3623228 -0.26088265 -0.3756385
## TCL1A   2.6529328   -0.3658656  -0.3601593 -0.3457438 -0.07058998 -0.3691003
## LTB     1.3010444    1.4217753   0.9983871  0.1222852 -0.68256844 -0.9210437
## CD2    -0.8165695    1.5900673   0.7584343  1.3917165 -0.39990926 -0.8386137
## GIMAP7 -1.3653377    1.2997189   0.6523547  0.8297530 -0.46845278 -0.3241219
## AQP3   -0.6067480    2.1996497   0.9870714  0.2866108 -0.59156893 -0.7475342
## MALAT1  0.5479293    0.5091795   0.8464071  0.8334376 -0.46049421 -0.5621087
## MYC     0.6918531    1.2562329   1.6091033  0.3373354 -0.57521155 -0.7662308
##        Mono FCGR3A    NK cell   Platelet
## CD79A  -0.35758943 -0.3419031 -0.3677019
## CD79B   0.56448664 -0.1825349 -0.5658126
## MS4A1  -0.31078578 -0.3372859 -0.2759042
## TCL1A  -0.34777379 -0.3677649 -0.4259352
## LTB    -0.31881117 -0.8799306 -1.0411381
## CD2    -0.71405995 -0.0121216 -0.9589440
## GIMAP7 -0.11671624  0.9434342 -1.4506321
## AQP3   -0.68100419 -0.4787048 -0.3677718
## MALAT1 -0.05320274  0.6045094 -2.2656573
## MYC    -0.70390444 -0.6826998 -1.1664781

Now, we can produce a basic heatmap:

Heatmap(toplot, lab_fill = "zscore")

Note that this provides a convenient method to display marker genes (or differentially expressed genes, DEGs) for each cluster, which is very practical. However, this is not the only method. Let’s also learn about Seurat’s built-in method for finding DEGs.

# Find all markers of B cells
bcell.markers <- FindMarkers(pbmc, ident.1 = "B cell", logfc.threshold = 1, only.pos = TRUE)
head(bcell.markers)
##                   p_val avg_log2FC pct.1 pct.2     p_val_adj
## CD79A      0.000000e+00   6.783307 0.934 0.043  0.000000e+00
## MS4A1      0.000000e+00   5.737084 0.860 0.052  0.000000e+00
## LINC00926 2.387605e-276   7.258980 0.562 0.009 3.971542e-272
## CD79B     2.983575e-275   4.595655 0.911 0.143 4.962879e-271
## TCL1A     8.485045e-274   6.612804 0.619 0.022 1.411402e-269
## HLA-DQA1  4.817482e-269   3.984138 0.885 0.118 8.013400e-265

This compares all B cells to non-B cells to find DEGs. If you want to compare B cells with specific clusters, you can set the ident.2 parameter.

We can also use FindAllMarkers to conveniently generate marker genes for all clusters:

pbmc.markers <- FindAllMarkers(pbmc, logfc.threshold = 2, only.pos = TRUE)
## Calculating cluster B cell
## Calculating cluster CD4 T Memory
## Calculating cluster CD4 T Naive
## Calculating cluster CD8 T cell
## Calculating cluster DC
## Calculating cluster Mono CD14
## Calculating cluster Mono FCGR3A
## Calculating cluster NK cell
## Calculating cluster Platelet
head(pbmc.markers, 10)
##                   p_val avg_log2FC pct.1 pct.2     p_val_adj cluster      gene
## CD79A      0.000000e+00   6.783307 0.934 0.043  0.000000e+00  B cell     CD79A
## MS4A1      0.000000e+00   5.737084 0.860 0.052  0.000000e+00  B cell     MS4A1
## LINC00926 2.387605e-276   7.258980 0.562 0.009 3.971542e-272  B cell LINC00926
## CD79B     2.983575e-275   4.595655 0.911 0.143 4.962879e-271  B cell     CD79B
## TCL1A     8.485045e-274   6.612804 0.619 0.022 1.411402e-269  B cell     TCL1A
## HLA-DQA1  4.817482e-269   3.984138 0.885 0.118 8.013400e-265  B cell  HLA-DQA1
## VPREB3    6.983605e-238   6.957038 0.484 0.007 1.161653e-233  B cell    VPREB3
## HLA-DQB1  3.947592e-234   4.054111 0.862 0.146 6.566424e-230  B cell  HLA-DQB1
## CD74      1.691378e-188   3.004657 1.000 0.818 2.813439e-184  B cell      CD74
## HLA-DRA   1.834090e-187   2.864834 1.000 0.490 3.050826e-183  B cell   HLA-DRA

Customizing the Heatmap

Modifying Axis and Labels

Sometimes, the first name on the x-axis might be too long and exceed the left boundary of the plot. To prevent this issue and ensure all labels are fully visible, you can increase the space on the left side of the plot by adjusting the plot.margin parameter:

Heatmap(toplot, lab_fill = "zscore", plot.margin = margin(l = 30))

For denser matrices, you may wish to only show a subset of gene names:

genes <- VariableFeatures(pbmc)[1:500]
toplot2 <- CalcStats(pbmc, features = genes, method = "zscore", order = "p")
Heatmap(toplot2, lab_fill = "zscore", feature_text_subset = genes[1:20], expand_limits_x = c(-0.5, 12))

Faceting the Heatmap

You can also split the heatmap based on gene groups:

gene_groups <- sample(c("group1", "group2", "group3"), nrow(toplot2), replace = TRUE)
Heatmap(toplot2, lab_fill = "zscore", facet_row = gene_groups) +
  theme(axis.text.y = element_blank())

6. Generate a Waterfall Plot

Introduction

Waterfall plots are powerful visualization tools that can display differences between two conditions, showing gene expression, gene set enrichment, or other metrics. This function can handle inputs directly from Seurat objects or pre-processed matrices.

You can use the waterfall plot to compare expression levels of genes directly from a Seurat object, using LogFC to determine the bar length. Here’s how to do it for the top 80 variable features:

genes <- VariableFeatures(pbmc)[1:80]
WaterfallPlot(
  pbmc, group.by = "cluster", features = genes,
  ident.1 = "Mono CD14", ident.2 = "CD8 T cell", length = "logFC")

Focusing on Extremes

To further hone in on the most differentially expressed genes, you might want to keep only the top and bottom 20 genes. This can highlight the most critical differences between the two cell types:

WaterfallPlot(
  pbmc, group.by = "cluster", features = genes,
  ident.1 = "Mono CD14", ident.2 = "CD8 T cell", length = "logFC",
  top.n = 20)

Additionally, you can set parameters to modify the plot, such as filtering features based on a specific threshold, rotating the plot 90 degrees (by setting flip), and more. For detailed options, refer to the function’s help documentation.

Tips: Comparing Heatmaps, Violin Plots, and Waterfall Plots

When choosing between heatmaps, violin plots, and waterfall plots, consider the following:

  1. Heatmaps:
    • Advantage: Can compare multiple groups and a large number of features.
    • Limitation: Uses color to represent average expression levels, which may lack precision.
  2. Violin Plots:
    • Advantage: Can display multiple groups and quantitatively show expression levels.
    • Limitation: Takes up more space, limiting the number of features that can be shown compared to heatmaps.
  3. Waterfall Plots:
    • Advantage: Can display a large number of features.
    • Limitation: Focuses on comparison between only two groups.

7. Explore Color Functions

SeuratExtend provides various color themes to choose from. You can adjust color-related parameters (such as cols or col_theme) in visualization functions like DimPlot2, VlnPlot2, Heatmap, and WaterfallPlot. For more detailed information, refer to the online tutorial: https://huayc09.github.io/SeuratExtend/articles/Visualization.html#explore-color-functions

Here are some recommended color schemes for visualization:

For discrete variables, SeuratExtend’s “light” color scheme:

library(cowplot)
plot_grid(
  DimPlot2(pbmc, cols = "light", theme = NoAxes() + NoLegend()),
  ClusterDistrBar(pbmc$sample, pbmc$cluster, cols = "light", flip = FALSE, border = "black", reverse_order = FALSE) +
    theme(axis.title.x = element_blank())
)

For continuous variables, the “A” or “D” color schemes from viridis:

Heatmap(toplot, lab_fill = "zscore", color_scheme = "A")

WaterfallPlot(
  pbmc, group.by = "cluster", features = genes,
  ident.1 = "Mono CD14", ident.2 = "CD8 T cell", length = "logFC",
  top.n = 20, color_theme = "D")

You can also use various color palettes from RColorBrewer (https://r-graph-gallery.com/38-rcolorbrewers-palettes.html):

library(RColorBrewer)
markers_genes <- c(
  "MS4A1", "GNLY", "CD3E",
  "CD8A", "CCR7", "CD14",
  "FCER1A", "FCGR3A", "PPBP")
DimPlot2(
  pbmc,
  features = markers_genes,
  cols = brewer.pal(9, "GnBu"),
  theme = NoAxes(), pt.size = 0.3)

DimPlot2(
  pbmc,
  features = markers_genes,
  cols = rev(brewer.pal(11,"Spectral")),
  theme = NoAxes(), pt.size = 0.3)

SeuratExtend will soon be updated to include more built-in color schemes for selection. Stay tuned for these exciting additions!

8. Mastering ggplot2 for Custom Visualizations

Introduction to ggplot2

While SeuratExtend provides many convenient functions for visualization, all of these are built on top of ggplot2. Understanding ggplot2 can give you more flexibility in customizing your plots and creating entirely new visualizations. In this section, we’ll cover the basics of ggplot2 and how to use it with Seurat data.

The Grammar of Graphics

ggplot2 is based on the Grammar of Graphics, a layered approach to creating visualizations. The basic components are:

  1. Data: The dataset you want to visualize
  2. Aesthetics: How your data maps to visual properties (position, color, size, etc.)
  3. Geometries: The actual shapes used to represent your data (points, lines, bars, etc.)
  4. Facets: Optional subdivision of the plot into multiple plots
  5. Statistics: Optional statistical transformations of the data
  6. Coordinates: The space on which the data will be plotted
  7. Themes: All non-data ink

Creating a Basic Plot from Seurat Data

Let’s start by creating the most basic UMAP plot using ggplot2:

# Extract data from the Seurat object
umap_data <- FetchData(pbmc, vars = c("umap_1", "umap_2", "cluster", "CD3D", "sample"))
head(umap_data)
##                       umap_1     umap_2      cluster     CD3D  sample
## AAACATACAACCAC-1   2.8657047   4.409265  CD4 T Naive 2.863463 sample1
## AAACATTGAGCTAC-1   6.4355502 -13.392225       B cell 0.000000 sample1
## AAACATTGATCAGC-1   0.5606911   1.904406 CD4 T Memory 3.489090 sample1
## AAACCGTGCTTCCG-1 -10.5961605  -3.310064  Mono FCGR3A 0.000000 sample1
## AAACCGTGTATGCG-1   8.9127020   3.276543      NK cell 0.000000 sample1
## AAACGCACTGGTAC-1   0.7217073   5.876705  CD4 T Naive 1.726522 sample1
# Create the most basic plot
ggplot(umap_data, aes(x = umap_1, y = umap_2, color = CD3D)) +
  geom_point()

This is the most basic plot we can create. It shows the UMAP coordinates of our cells, colored by CD3D expression. However, it’s not very visually appealing or informative in its current state. Let’s improve it step by step:

# Create an improved plot
library(viridis)
ggplot(umap_data, aes(x = umap_1, y = umap_2, color = CD3D)) +
  geom_point(size = 0.5, alpha = 0.7) +
  scale_color_viridis() +
  theme_minimal() +
  labs(title = "UMAP colored by CD3D expression",
       x = "UMAP_1", y = "UMAP_2")

Let’s break down the improvements we made:

  1. We adjusted the size and transparency of the points using size = 0.5, alpha = 0.7 in geom_point(). This helps to visualize dense areas better.
  2. We used scale_color_viridis() to apply a colorblind-friendly, perceptually uniform color scale.
  3. We applied theme_minimal() to remove unnecessary plot elements and give it a cleaner look.
  4. We added a title and axis labels using labs().

These small changes significantly improve the readability and aesthetics of our plot. In the following sections, we’ll explore even more ways to customize and enhance our visualizations.

Customizing the Plot

Now let’s customize our plot further:

ggplot(umap_data, aes(x = umap_1, y = umap_2, color = CD3D)) +
  geom_point(size = 0.5, alpha = 0.7) +
  scale_color_viridis(option = "A") +
  facet_wrap(~sample, ncol = 3) +
  theme_bw() +
  theme(axis.text = element_blank(),
        axis.ticks = element_blank(),
        panel.grid = element_blank(),
        strip.background = element_rect(fill = "white"),
        strip.text = element_text(face = "bold")) +
  labs(title = "UMAP by cluster, colored by CD3D expression",
       x = "UMAP_1", y = "UMAP_2")

In this version:

  1. We’ve used facet_wrap() to create separate plots for each cluster.
  2. We’ve customized the theme further using theme() to remove axis text, ticks, and grid lines.
  3. We’ve styled the facet labels with a white background and bold text.

Combining Multiple Plots

Often, we want to combine multiple plots into a single figure. The cowplot package is great for this:

library(cowplot)

# Create two plots
plot1 <- ggplot(umap_data, aes(x = umap_1, y = umap_2, color = cluster)) +
  geom_point(size = 0.5, alpha = 0.7) +
  scale_color_manual(values = color_pro(9,2)) +
  theme_minimal() +
  theme(legend.position = "none") +
  labs(title = "Clusters")

plot2 <- ggplot(umap_data, aes(x = umap_1, y = umap_2, color = CD3D)) +
  geom_point(size = 0.5, alpha = 0.7) +
  scale_color_viridis_c(option = "A") +
  theme_minimal() +
  labs(title = "CD3D Expression")

# Combine plots
combined_plot <- plot_grid(plot1, plot2, labels = c("A", "B"), ncol = 2)

# Add a title to the combined plot
title <- ggdraw() + 
  draw_label("UMAP Visualization of PBMC Data", fontface = "bold", x = 0, hjust = 0) +
  theme(plot.margin = margin(0, 0, 0, 7))

plot_grid(title, combined_plot, ncol = 1, rel_heights = c(0.1, 1))

Saving Your Plots

Finally, let’s save our plot using ggsave():

dir.create("results", showWarnings = FALSE)
ggsave("results/pbmc_umap_plot.png", combined_plot, width = 8, height = 4.5, dpi = 300)

This will save the plot as a high-resolution PNG file.

Tips for Further Customization

Remember, while ggplot2 offers extensive customization options, it can sometimes be complex. Don’t hesitate to consult the ggplot2 documentation, online resources, or AI assistants like Claude or ChatGPT for help with specific customizations.

This concludes our lesson on advanced visualization techniques. You now have a solid foundation in creating and customizing plots for your single-cell RNA-seq data analysis. In the next lesson, we’ll explore more advanced analytical techniques to further your understanding of single-cell data.